3 + 5[1] 8
This workshop is designed to provide beginners with foundational understanding of R programming language. Through a combination of theoretical explanations, hands-on coding exercises, and practical applications, participants will learn how to leverage R for data visualization of cancer biology datasets.
The workshop will cover essential programming concepts and gradually introduce more advanced topics, with a focus on using the ggplot2 package suite for data visualization. The aim of this workshop is to analyse data and create informative plots.
Participants will gain the following skills:
readr package.ggplot2 package.Before starting this course you will need to ensure that your computer is set up with the required software. If you have any difficulty installing any of this software then please contact one of the trainers for help.
R and RStudio are separate downloads and installations.
R is the underlying statistical computing environment. The base R system and a very large collection of packages that give you access to a huge range of statistical and analytical functionality are available from CRAN, the Comprehensive R Archive Network.
However, using R alone is no fun. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive.
You need to install R before you install RStudio.
If you already have R and RStudio installed:
sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. You can check here for more information on how to remove old versions from your system if you wish to do so.If you don’t have R and RStudio installed:
.exe file that was just downloadedIf you already have R and RStudio installed:
sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it.If you don’t have R and RStudio installed:
.pkg file for the latest R versionsudo apt-get install r-base, and for Fedora sudo yum install R), but we don’t recommend this approach as the versions provided by this are usually out of date. In any case, make sure you have at least R 4.3.2.sudo dpkg -i rstudio-x.yy.zzz-amd64.deb at the terminal).On this course we will be making use of a brilliant collection of packages designed for data science called the tidyverse that make it much easier and more fun to work with your data. After installing R and RStudio, follow the instructions below to install the tidyverse package suite.
install.packages("tidyverse") (look for the ‘Console’ tab and type at the > prompt)The Metabric study characterized the genomic mutations and gene expression profiles for 2509 primary breast tumours. In addition to the gene expression data generated using microarrays, genome-wide copy number profiles were obtained using SNP microarrays. Targeted sequencing was performed for 2509 primary breast tumours, along with 548 matched normals, using a panel of 173 of the most frequently mutated breast cancer genes as part of the Metabric study.
Refrences:
Both the clinical data and the gene expression values were downloaded from cBioPortal.
We excluded observations for patient tumor samples lacking expression data, resulting in a data set with fewer rows.
R is a powerful programming language and open-source software widely used for statistical computing and data analysis. This programming language is developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R has gained popularity among statisticians, data scientists, researchers, and analysts for its flexibility, extensibility, and robust statistical capabilities.
Here are several compelling reasons to consider learning R:
To begin working with R, users typically install an Integrated Development Environment (IDE) such as RStudio, which provides a user-friendly interface for coding, debugging, and visualizing results. R scripts are written in the R language and can be executed interactively or saved for later use.
Open RStudio. You will see four windows (aka panes). Each window has a different function. The screenshot below shows an analogy linking the different RStudio windows to cooking.
On the left-hand side, you’ll find the console. This is where you can input commands (code that R can interpret), and the responses to your commands, known as output, are displayed here. While the console is handy for experimenting with code, it doesn’t save any of your entered commands. Therefore, relying exclusively on the console is not recommended.
The history pane (located in the top right window) maintains a record of the commands that you have executed in the R console during your current R session. This includes both correct and incorrect commands.
You can navigate through your command history using the up and down arrow keys in the console. This allows you to quickly recall and re-run previous commands without retyping them.
The environment pane (located in the top right window) provides an overview of the objects (variables, data frames, etc.) that currently exist in your R session. It displays the names, types, dimensions, and some content of these objects. This allows you to monitor the state of your workspace in real-time.
The plotting pane (located in the bottom right window) is where graphical output, such as plots and charts, is displayed when you create visualizations in R. The Plotting pane often includes tools for zooming, panning, and exporting plots, providing additional functionality for exploring and customizing your visualizations.
The help pane (located in the bottom right window) is a valuable resource for accessing documentation and information about R functions, packages, and commands. When you type a function or command in the console and press the F1 key (Mac: fn + F1) the Help pane displays relevant documentation. Additionally, you can type a keyword in the text box at the top right corner of the Help Pane.
The files pane provides a file browser and file management interface within RStudio. It allows you to navigate through your project directories, view files, and manage your file system.
This pane provides a user-friendly interface for managing R packages. It lists installed packages and allows you to load, unload, update, and install packages.
It is used to display dynamic content generated by R, such as HTML, Shiny applications, or interactive visualizations.
Opening an RStudio session launches it from a specific location. This is the working directory. R looks in the working directory by default to read in data and save files. You can find out what the working directory is by using the command getwd(). This shows you the path to your working directory in the console. In Mac this is in the format /path/to/working/directory and in Windows C:\path\to\working\directory. It is often useful to have your data and R scripts in the same directory and set this as your working directory. We will do this now.
Make a folder for this course somewhere on your computer that you will be able to easily find. Name the folder for example, Intro_R_course. Then, to set this folder as your working directory:
In RStudio click on the Files tab and then click on the three dots, as shown below.
In the window that appears, find the folder you created (e.g. Intro_R_course), click on it, then click Open. The files tab will now show the contents of your new folder. Click on More → Set As Working Directory, as shown below.
Note: You can use an RStudio project as described here to automatically keep track of and set the working directory.
In RStudio, the Script pane (located at the top left window) serves as a dedicated space for writing, editing, and executing Quarto documents. It is where you compose and organize your R code, making it an essential area for creating reproducible and well-documented analyses.
RStudio provides syntax highlighting in the Script pane, making it easier to identify different components of your code. You can execute individual lines or selections of code from the Script pane. This helps in testing and debugging code without running the entire document.
Navigate to File → New File → Quarto Document, a new pane will emerge in the top-left corner.
Add a title (e.g. IntroR), your name as Author and save this document as ‘IntroR-doc.qmd’ in your current working directory (e.g. IntroR).
Executing commands or running code is the process of submitting a command to your computer, which does some computation and returns an answer. In RStudio, there are several ways to execute commands:
We suggest the third option, which is fastest. This link provides a list of useful RStudio keyboard shortcuts that can be beneficial when coding and navigating the RStudio IDE.
When you type in, and then run the commands shown in the grey boxes below, you should see the result in the Console pane at bottom left.
We can use R as a calculator to do simple maths.
More complex calculator functions are built in to R, which is the reason it is popular among mathematicians and statisticians. To use these functions, we need to call these functions.
In R, the ? and ?? operators are used for accessing help documentation, but they behave slightly differently.
? operator is used to access help documentation for a specific function or topic. When you type ? followed by the name of a function, you get detailed information about that function. For example try:| mean | R Documentation |
Generic function for the (trimmed) arithmetic mean.
mean(x, ...)
## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
x
|
An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for |
trim
|
the fraction (0 to 0.5) of observations to be trimmed from each end of |
na.rm
|
a logical evaluating to |
…
|
further arguments passed to or from other methods. |
If trim is zero (the default), the arithmetic mean of the values in x is computed, as a numeric or complex vector of length one. If x is not logical (coerced to numeric), numeric (including integer) or complex, NA_real_ is returned, with a warning.
If trim is non-zero, a symmetrically trimmed mean is computed with a fraction of trim observations deleted from each end before the mean is computed.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
weighted.mean, mean.POSIXct, colMeans for row and column means.
x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))
The above command displays the help documentation for the mean function, providing information about its usage, arguments, and examples.
?? operator is used for a broader search across help documentation. It performs a search for the specified term or keyword in the documentation.This will search for the term “regression” in the help documentation and return relevant results. It’s useful when you want to find functions, packages, or topics related to a specific term.
Tab completion A very useful feature is Tab completion. You can start typing and use Tab to autocomplete code, for example, a function name.
Many developers have built 1000s of functions and shared them with the R user community to help make everyone’s work easier and more efficient. These functions (short programs) are generally packaged up together in (wait for it) Packages. For example, the tidyverse package is a compilation of many different functions, all of which help with data transformation and visualization. Packages also contain data, which is often included to assist new users with learning the available functions.
Packages are hosted on repositories, with CRAN (Comprehensive R Archive Network) being the primary repository. To install packages from CRAN, you use the install.packages() function. For example:
This will spit out a lot of text into the console as the package is being installed. Once complete you should have a message:
The downloaded binary packages are in... followed by a long directory name.
To remove an installed package:
After installation, you need to load a package into your R session using the library() function. For example:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
This makes the functions and datasets from the ‘tidyverse’ package available for use in your current session.
You only need to install a package once. Once installed, you don’t need to reinstall it in subsequent sessions. However, you do need to load the package at the beginning of each R session using the library() function before you can utilize its functions and features. This ensures that the package is actively available for use in your current session.
To view packages currently loaded into memory:
[1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
[7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
[13] "grDevices" "utils" "datasets" "methods" "base"
[1] ".GlobalEnv" "package:lubridate" "package:forcats"
[4] "package:stringr" "package:dplyr" "package:purrr"
[7] "package:readr" "package:tidyr" "package:tibble"
[10] "package:ggplot2" "package:tidyverse" "package:stats"
[13] "package:graphics" "package:grDevices" "package:utils"
[16] "package:datasets" "package:methods" "Autoloads"
[19] "package:base"
Each package comes with documentation that explains how to use its functions. You can access this information using the help() function or by using ? before the function name:
| tidyverse-package | R Documentation |
The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. Learn more about the ‘tidyverse’ at https://www.tidyverse.org.
Maintainer: Hadley Wickham hadley@rstudio.com
Other contributors:
RStudio [copyright holder, funder]
Useful links:
or by using vignette (if the documentation is in the form of vignettes):
ggplot2 package simplifies the creation of plots using data frames. This package offers a streamlined interface for defining variables to plot, configuring their display, and adjusting visual attributes. Consequently, adapting to changes in the data or transitioning between plot types requires only minimal modifications. This feature facilitates the creation of high-quality plots suitable for publication with minimal manual adjustments.
In this section, you’ll learn the basics of reading data files into R using the readr package. We will use the read_csv() function from readr package to import a dataset. CSV short for Comma Separated Values, is a text format commonly used to store tabular data. Conventionally the first line contains column headings.
The first argument of the read_csv() function takes the path to the file (or a web link). The following code will download the metabric dataset.
In the previous section we imported a dataset, into a dataframe named metabric. This section demonstrates different ways to view this dataset.
When the name of the object (data frame) is typed, the first few lines along with some information, such as the number of rows are displayed:
| Patient_ID | Cohort | Age_at_diagnosis | Survival_time | Survival_status | Vital_status | Chemotherapy | Radiotherapy | Tumour_size | Tumour_stage | Neoplasm_histologic_grade | Lymph_nodes_examined_positive | Lymph_node_status | Cancer_type | ER_status | PR_status | HER2_status | HER2_status_measured_by_SNP6 | PAM50 | 3-gene_classifier | Nottingham_prognostic_index | Cellularity | Integrative_cluster | Mutation_count | ESR1 | ERBB2 | PGR | TP53 | PIK3CA | GATA3 | FOXA1 | MLPH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MB-0000 | 1 | 75.65 | 140.50000 | LIVING | Living | NO | YES | 22 | 2 | 3 | 10 | 3 | Breast Invasive Ductal Carcinoma | Positive | Negative | Negative | NEUTRAL | claudin-low | ER-/HER2- | 6.044 | NA | 4ER+ | NA | 8.929817 | 9.333972 | 5.680501 | 6.338739 | 5.704157 | 6.932146 | 7.953794 | 9.729728 |
| MB-0002 | 1 | 43.19 | 84.63333 | LIVING | Living | NO | YES | 10 | 1 | 3 | 0 | 1 | Breast Invasive Ductal Carcinoma | Positive | Positive | Negative | NEUTRAL | LumA | ER+/HER2- High Prolif | 4.020 | High | 4ER+ | 2 | 10.047059 | 9.729606 | 7.505424 | 6.192507 | 5.757727 | 11.251197 | 11.843989 | 12.536570 |
| MB-0005 | 1 | 48.87 | 163.70000 | DECEASED | Died of Disease | YES | NO | 15 | 2 | 2 | 1 | 2 | Breast Invasive Ductal Carcinoma | Positive | Positive | Negative | NEUTRAL | LumB | NA | 4.030 | High | 3 | 2 | 10.041281 | 9.725825 | 7.376123 | 6.404516 | 6.751566 | 9.289758 | 11.698169 | 10.306115 |
| MB-0006 | 1 | 47.68 | 164.93333 | LIVING | Living | YES | YES | 25 | 2 | 2 | 3 | 2 | Breast Mixed Ductal and Lobular Carcinoma | Positive | Positive | Negative | NEUTRAL | LumB | NA | 4.050 | Moderate | 9 | 1 | 10.404685 | 10.334979 | 6.815637 | 6.869241 | 7.219187 | 8.667723 | 11.863379 | 10.472181 |
| MB-0008 | 1 | 76.97 | 41.36667 | DECEASED | Died of Disease | YES | YES | 40 | 2 | 3 | 8 | 3 | Breast Mixed Ductal and Lobular Carcinoma | Positive | Positive | Negative | NEUTRAL | LumB | ER+/HER2- High Prolif | 6.080 | High | 9 | 2 | 11.276581 | 9.956267 | 7.331223 | 6.337951 | 5.817818 | 9.719781 | 11.625006 | 12.161961 |
| MB-0010 | 1 | 78.77 | 7.80000 | DECEASED | Died of Disease | NO | YES | 31 | 4 | 3 | 0 | 1 | Breast Invasive Ductal Carcinoma | Positive | Positive | Negative | NEUTRAL | LumB | ER+/HER2- High Prolif | 4.062 | Moderate | 7 | 4 | 11.239750 | 9.739996 | 5.954311 | 5.419711 | 6.123056 | 9.787085 | 12.142178 | 11.433164 |
| MB-0014 | 1 | 56.45 | 164.33333 | LIVING | Living | YES | YES | 10 | 2 | 2 | 1 | 2 | Breast Invasive Ductal Carcinoma | Positive | Positive | Negative | LOSS | LumB | NA | 4.020 | Moderate | 3 | 4 | 10.793832 | 9.276507 | 7.720952 | 5.992706 | 7.481835 | 8.365527 | 11.482627 | 10.755199 |
| MB-0022 | 1 | 89.08 | 99.53333 | DECEASED | Died of Other Causes | NO | YES | 29 | 2 | 2 | 1 | 2 | Breast Mixed Ductal and Lobular Carcinoma | Positive | Negative | Negative | NEUTRAL | claudin-low | NA | 4.058 | Moderate | 3 | 1 | 10.440667 | 8.613192 | 5.592522 | 6.165420 | 7.593330 | 7.872962 | 10.679403 | 9.945023 |
| MB-0028 | 1 | 86.41 | 36.56667 | DECEASED | Died of Other Causes | NO | YES | 16 | 2 | 3 | 1 | 2 | Breast Invasive Ductal Carcinoma | Positive | Negative | Negative | GAIN | LumB | ER+/HER2- High Prolif | 5.032 | Moderate | 9 | 4 | 12.521038 | 10.678266 | 5.325554 | 6.220372 | 6.250678 | 10.260059 | 12.148375 | 10.936002 |
| MB-0035 | 1 | 84.22 | 36.26667 | DECEASED | Died of Disease | NO | NO | 28 | 2 | 2 | 0 | 1 | Breast Invasive Lobular Carcinoma | Positive | Negative | Negative | LOSS | Her2 | ER+/HER2- High Prolif | 3.056 | High | 3 | 5 | 7.536847 | 11.514514 | 5.587666 | 6.411477 | 5.988243 | 10.212611 | 12.804542 | 13.474571 |
The dim() function prints the dimensions (rows x columns) of the data frame:
[1] 1904 32
This information is available at the environment pane in the top right panel as the number of observations (rows) and variables (columns).
The nrow() function prints the number of rows while ncol() prints the number of columns:
[1] 1904
[1] 32
The View() function gives a spreadsheet-like view of the data frame:
By clicking the object on the environment tab also gives a spreadsheet-like view of the object:
The head() function prints the top 6 rows of a data frame:
| Patient_ID | Cohort | Age_at_diagnosis | Survival_time | Survival_status | Vital_status | Chemotherapy | Radiotherapy | Tumour_size | Tumour_stage | Neoplasm_histologic_grade | Lymph_nodes_examined_positive | Lymph_node_status | Cancer_type | ER_status | PR_status | HER2_status | HER2_status_measured_by_SNP6 | PAM50 | 3-gene_classifier | Nottingham_prognostic_index | Cellularity | Integrative_cluster | Mutation_count | ESR1 | ERBB2 | PGR | TP53 | PIK3CA | GATA3 | FOXA1 | MLPH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MB-0000 | 1 | 75.65 | 140.50000 | LIVING | Living | NO | YES | 22 | 2 | 3 | 10 | 3 | Breast Invasive Ductal Carcinoma | Positive | Negative | Negative | NEUTRAL | claudin-low | ER-/HER2- | 6.044 | NA | 4ER+ | NA | 8.929817 | 9.333972 | 5.680501 | 6.338739 | 5.704157 | 6.932146 | 7.953794 | 9.729728 |
| MB-0002 | 1 | 43.19 | 84.63333 | LIVING | Living | NO | YES | 10 | 1 | 3 | 0 | 1 | Breast Invasive Ductal Carcinoma | Positive | Positive | Negative | NEUTRAL | LumA | ER+/HER2- High Prolif | 4.020 | High | 4ER+ | 2 | 10.047059 | 9.729606 | 7.505424 | 6.192507 | 5.757727 | 11.251197 | 11.843989 | 12.536570 |
| MB-0005 | 1 | 48.87 | 163.70000 | DECEASED | Died of Disease | YES | NO | 15 | 2 | 2 | 1 | 2 | Breast Invasive Ductal Carcinoma | Positive | Positive | Negative | NEUTRAL | LumB | NA | 4.030 | High | 3 | 2 | 10.041281 | 9.725825 | 7.376123 | 6.404516 | 6.751566 | 9.289758 | 11.698169 | 10.306115 |
| MB-0006 | 1 | 47.68 | 164.93333 | LIVING | Living | YES | YES | 25 | 2 | 2 | 3 | 2 | Breast Mixed Ductal and Lobular Carcinoma | Positive | Positive | Negative | NEUTRAL | LumB | NA | 4.050 | Moderate | 9 | 1 | 10.404685 | 10.334979 | 6.815637 | 6.869241 | 7.219187 | 8.667723 | 11.863379 | 10.472181 |
| MB-0008 | 1 | 76.97 | 41.36667 | DECEASED | Died of Disease | YES | YES | 40 | 2 | 3 | 8 | 3 | Breast Mixed Ductal and Lobular Carcinoma | Positive | Positive | Negative | NEUTRAL | LumB | ER+/HER2- High Prolif | 6.080 | High | 9 | 2 | 11.276581 | 9.956267 | 7.331223 | 6.337951 | 5.817818 | 9.719781 | 11.625006 | 12.161961 |
| MB-0010 | 1 | 78.77 | 7.80000 | DECEASED | Died of Disease | NO | YES | 31 | 4 | 3 | 0 | 1 | Breast Invasive Ductal Carcinoma | Positive | Positive | Negative | NEUTRAL | LumB | ER+/HER2- High Prolif | 4.062 | Moderate | 7 | 4 | 11.239750 | 9.739996 | 5.954311 | 5.419711 | 6.123056 | 9.787085 | 12.142178 | 11.433164 |
Similarly, the tail() function prints the bottom 6 rows of the data frame:
| Patient_ID | Cohort | Age_at_diagnosis | Survival_time | Survival_status | Vital_status | Chemotherapy | Radiotherapy | Tumour_size | Tumour_stage | Neoplasm_histologic_grade | Lymph_nodes_examined_positive | Lymph_node_status | Cancer_type | ER_status | PR_status | HER2_status | HER2_status_measured_by_SNP6 | PAM50 | 3-gene_classifier | Nottingham_prognostic_index | Cellularity | Integrative_cluster | Mutation_count | ESR1 | ERBB2 | PGR | TP53 | PIK3CA | GATA3 | FOXA1 | MLPH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MB-7294 | 4 | 59.20 | 82.73333 | DECEASED | Died of Disease | NO | NO | 15 | NA | 2 | 1 | 2 | Breast Invasive Ductal Carcinoma | Positive | Positive | Negative | GAIN | LumB | ER+/HER2- High Prolif | 4.03 | High | 1 | 2 | 11.290976 | 10.846545 | 7.312247 | 5.660943 | 6.190000 | 9.424235 | 11.07569 | 11.567166 |
| MB-7295 | 4 | 43.10 | 196.86667 | LIVING | Living | NO | YES | 25 | NA | 3 | 1 | 2 | Breast Invasive Lobular Carcinoma | Positive | Positive | Negative | NEUTRAL | LumA | ER+/HER2- Low Prolif | 5.05 | High | 3 | 4 | 9.591235 | 9.935178 | 7.984515 | 6.753291 | 6.279207 | 9.207323 | 11.28119 | 11.337601 |
| MB-7296 | 4 | 42.88 | 44.73333 | DECEASED | Died of Disease | NO | YES | 20 | NA | 3 | 1 | 2 | Breast Invasive Ductal Carcinoma | Positive | Negative | Positive | GAIN | LumB | NA | 5.04 | High | 5 | 6 | 9.733986 | 13.753037 | 5.616082 | 6.271912 | 5.999093 | 9.530390 | 11.53203 | 11.626140 |
| MB-7297 | 4 | 62.90 | 175.96667 | DECEASED | Died of Disease | NO | YES | 25 | NA | 3 | 45 | 3 | Breast Invasive Ductal Carcinoma | Positive | Positive | Negative | NEUTRAL | LumB | NA | 6.05 | High | 1 | 4 | 11.053198 | 10.228570 | 7.478069 | 6.212256 | 6.192399 | 9.540589 | 11.48276 | 11.180360 |
| MB-7298 | 4 | 61.16 | 86.23333 | DECEASED | Died of Other Causes | NO | NO | 25 | NA | 2 | 12 | 3 | Breast Invasive Ductal Carcinoma | Positive | Positive | Negative | NEUTRAL | LumB | ER+/HER2- High Prolif | 5.05 | Moderate | 1 | 15 | 11.055114 | 9.892589 | 8.282737 | 6.466712 | 6.287254 | 10.365901 | 11.37118 | 12.827069 |
| MB-7299 | 4 | 60.02 | 201.90000 | DECEASED | Died of Other Causes | NO | YES | 20 | NA | 3 | 1 | 2 | Breast Invasive Ductal Carcinoma | Positive | Negative | Negative | NEUTRAL | LumB | ER+/HER2- High Prolif | 5.04 | High | 10 | 3 | 10.696475 | 10.227787 | 5.533486 | 6.180511 | 6.208784 | 9.749368 | 10.86753 | 9.847856 |
The colnames() function displays all the column names:
[1] "Patient_ID" "Cohort"
[3] "Age_at_diagnosis" "Survival_time"
[5] "Survival_status" "Vital_status"
[7] "Chemotherapy" "Radiotherapy"
[9] "Tumour_size" "Tumour_stage"
[11] "Neoplasm_histologic_grade" "Lymph_nodes_examined_positive"
[13] "Lymph_node_status" "Cancer_type"
[15] "ER_status" "PR_status"
[17] "HER2_status" "HER2_status_measured_by_SNP6"
[19] "PAM50" "3-gene_classifier"
[21] "Nottingham_prognostic_index" "Cellularity"
[23] "Integrative_cluster" "Mutation_count"
[25] "ESR1" "ERBB2"
[27] "PGR" "TP53"
[29] "PIK3CA" "GATA3"
[31] "FOXA1" "MLPH"
The construction of ggplot graphics is incremental, allowing for the addition of new elements in layers. This approach grants users extensive flexibility and customization options, enabling the creation of tailored plots to suit specific needs.
To build a ggplot, the following basic template can be used for different types of plots.
Three things are required for a ggplot:
We first specify the data frame that contains the relevant data to create a plot. Here we are sending the metabric dataset to the ggplot() function.
This command results in an empty gray panel. We must specify how various columns of the data frame should be depicted in the plot.
aes()Next, we specify the columns in the data we want to map to visual properties (called aesthetics or aes in ggplot2). e.g. the columns for x values, y values and colours.
Since we are interested in generating a scatter plot, each point will have an x and a y coordinate. Therefore, we need to specify the x-axis to represent the transcription factor (GATA3) and y-axis to represent the estrogen receptor alpha (ESR1).
This results in a plot which includes the grid lines, the variables and the scales for x and y axes. However, the plot is empty or lacks data points.
geom_()Finally, we specify the type of plot (the geom). There are different types of geoms:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The range of geoms available in ggplot2 can be obtained by navigating to the ggplot2 package in the Packages tab pane in RStudio (bottom right-hand corner) and scrolling down the list of functions sorted alphabetically to the geom_... functions.
Since we are interested in creating a scatter plot, the geometric representation of the data will be in point form. Therefore we use the geom_point() function.
To plot the expression of estrogen receptor alpha (ESR1) against that of the transcription factor, GATA3:
Notice that we use the + sign to add a layer of points to the plot. This concept bears resemblance to Adobe Photoshop, where layers of images can be rearranged and edited independently. In ggplot, each layer is added over the plot in accordance with its position in the code using the + sign.
The above plot could be made more informative. For instance, the additional information regarding the ER status (i.e., ER_status column) could be incorporated into the plot. To do this, we can utilize aes() and specify which column in the metabric data frame should be represented as the color of the points.
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour)) +
geom_point(mapping = aes(colour = ER_status)) Notice that we specify the colour = ER_status argument in the aes() mapping inside the geom_() function instead of ggplot() function.
To colour points based on a continuous variable, for example: Nottingham prognostic index (NPI):
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(aes(colour = Neoplasm_histologic_grade)) In ggplot2, a color scale is used for continuous variables, while discrete or categorical values are represented using discrete colors.
Note that some patient samples lack expression values, leading ggplot2 to remove those points with missing values for ESR1 and GATA3.
Let’s add shape to points.
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(mapping = aes(shape = `3-gene_classifier`))Warning: Removed 204 rows containing missing values or values outside the scale range
(`geom_point()`).
Note that some patient samples have not been classified and ggplot has removed those points with missing values for the three-gene classifier.
The shape argument allows you to customize the appearance of all data points by assigning an integer associated with predefined shapes shown below:
To use asterix instead of points in the plot:
It would be useful to be able to change the shape of all the points. We can do so by setting the size to a single value rather than mapping it to one of the variables in the data set - this has to be done outside the aesthetic mappings (i.e. outside the aes() bit) as above.
Instead of mapping an aesthetic property to a variable, you can set it to a single value by specifying it in the layer parameters (outside aes()). We map an aesthetic to a variable (e.g., aes(shape =3-gene_classifier)) or set it to a constant (e.g., shape = 8). If you want appearance to be governed by a variable in your data frame, put the specification inside aes(); if you want to override the default size or colour, put the value outside of aes().
# size outside aes()
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(shape = 8)
# size inside aes()
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(aes(shape = `3-gene_classifier`))Warning: Removed 204 rows containing missing values or values outside the scale range
(`geom_point()`).
The above plots are created with similar code, but have rather different outputs. The first plot sets the size to a value and the second plot maps (not sets) the size to the three-gene classifier variable.
It is usually preferable to use colours to distinguish between different categories but sometimes colour and shape are used together when we want to show which group a data point belongs to in two different categorical variables.
We can adjust the size and/or transparency of the points.
Let’s first increase the size of points.
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(mapping = aes(colour = PAM50), size = 2)Note that here we add the size argument outside of the the aesthetic mapping.
Transparency can be useful when we have a large number of points as we can more easily tell when points are overlaid, but like size, it is not usually mapped to a variable and sits outside the aes().
Let’s change the transparency of points.
We can add another layer to this plot using a different geometric representation (or geom_ function) we discussed previously.
Let’s add trend lines to this plot using the geom_smooth() function which provide a summary of the data.
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Note that the shaded area surrounding blue line represents the standard error bounds on the fitted model.
Let’s make the plot look a bit prettier by reducing the size of the points and making them transparent. We’re not mapping size or alpha to any variables, just setting them to constant values, and we only want these settings to apply to the points, so we set them inside geom_point().
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth() `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Let’s add some colour to the plot.
By default, ggplot use the column names specified inside the aes() as the axis labels. We can change this using the x = and y = arguments in labs() function.
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(mapping = aes(colour = ER_status), size = 0.5, alpha = 0.5) +
geom_smooth() +
labs(x = "GATA3 Expression",
y = "ESR1 Expression")You can also add a title, a subtitle, a caption or a tag.
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(mapping = aes(colour = ER_status), size = 0.5, alpha = 0.5) +
geom_smooth() +
labs(
title = "Expression of estrogen receptor alpha against the transcription factor",
subtitle = "ESR1 vs GATA3",
caption = "This is a caption",
tag = "Figure 1",
x = "GATA3 Expression",
y = "ESR1 Expression")Themes control the overall appearance of the plot, including background color, grid lines, axis labels, and text styles. ggplot offers several built-in themes, and you can also create custom themes to match your preferences or the requirements of your publication. The default theme has a grey background.
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(mapping = aes(colour = ER_status), size = 0.5, alpha = 0.5) +
geom_smooth() + theme_bw()`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Try these themes yourselves: theme_classic(), theme_dark(), theme_grey() (default), theme_light(), theme_linedraw(), theme_minimal(), theme_void() and theme_test().
The metabric study redefined how we think about breast cancer by identifying and characterizing several new subtypes, referred to as integrative clusters. Let’s create a bar chart of the number of patients whose cancers fall within each subtype in the metabric cohort.
The geom_bar is the geom used to plot bar charts. It requires a single aesthetic mapping of the categorical variable of interest to x.
The dark grey bars are a big ugly - what if we want each bar to be a different colour?
Colouring the edges wasn’t quite what we had in mind. Look at the help for geom_bar to see what other aesthetic we should have used.
Box plots (or box & whisker plots) are a particular favourite seen in many seminars and papers. Box plots summarize the distribution of a set of values by displaying the minimum and maximum values, the median (i.e. middle-ranked value), and the range of the middle 50% of values (inter-quartile range). The whisker line extending above and below the IQR box define Q3 + (1.5 x IQR), and Q1 - (1.5 x IQR) respectively.
To create a box plot from Metabric dataset:
Let’s try a colour aesthetic to also look at how estrogen receptor expression differs between HER2 positive and negative tumours.
A violin plot is used to visualize the distribution of a numeric variable across different categories. It combines aspects of a box plot and a kernel density plot.
The width of the violin at any given point represents the density of data at that point. Wider sections indicate a higher density of data points, while narrower sections indicate lower density. By default, violin plots are symmetric.
The geom for creating histograms is, rather unsurprisingly, geom_histogram().
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The warning message hints at picking a more optimal number of bins by specifying the binwidth argument.
Or we can set the number of bins.
These histograms are not very pleasing, aesthetically speaking - how about some better aesthetics?
Use ggsave() to save the last plot you displayed.
You can alter the width and height of the plot and can change the image file type.
You are required to:
metabricIntegrative_cluster column and y-axis plots ESR1 column.labs(x = ?, y = ?) method to replace ? with correct x and y labels.These content were adapted from the Introduction to R: exploring the tidyverse course materials.
Comments
In R, any text following the hash symbol # is termed a comment. R disregards this text, considering it non-executable. Comments serve the purpose of documenting your code, aiding your future understanding of specific lines, and highlighting the intentions or challenges encountered.
RStudio makes it easy to comment or uncomment a paragraph: Select the lines you want to comment (to comment a set of lines) or placing the cursor at any location of a line (to comment a single line), press at the same time on your keyboard ⌘ + Shift + C (mac) or Ctrl + Shift + C (Windows/Linux).
Extensive use of comments is encouraged throughout this course.